Subspace Regularized Dynamic Time Warping for Spoken Query Detection

نویسندگان

  • Dhananjay Ram
  • Afsaneh Asaei
  • Hervé Bourlard
چکیده

Dynamic time warping (DTW) is an algorithm to find out the similarity between two temporal sequences of varying length. Previous works in this field can be traced back to as early as [1], for automatic speech recognition (ASR). Although this technique became obsolete for ASR with the advent of Hidden Markov Models (HMM) [2] and Deep Neural Network (DNN) based hybrid models [3], [4], DTW was found to be highly effective for spoken query detection, which refers to the task of searching a spoken query within an audio document. The key distinction is, unlike HMM and DNN solutions that require a large amount of annotated data to train the models, DTW can operate in low-resource conditions when training data is scarce. Therefore, DTW based systems are the state-of-the-art solutions for spoken query detection using one or a few examples of the query. Traditional DTW algorithm performs an end-to-end comparison between two temporal sequences. This is not exactly applicable to spoken query search because, the query can occur anywhere in the test audio as a sub-sequence. Therefore, variants of DTW such as segmental DTW [5] and sub-sequence DTW [6] are developed to address this limitation. In order to use these methods, phone posterior features are extracted [8] from the speech data as shown in Fig. 1. Now, given a query and an audio document, a distance matrix is computed between their phone posterior representations where each element of the matrix represents a frame-level distance. It is followed by a dynamic programming technique to find an optimal alignment between the frames of a query and a test audio. Although the methods discussed above are able to consider the sequential information present in a spoken query, they do not take into account the low-dimensional subspace structure of speech. Previously, we have proposed a novel sparse subspace modeling approach for query detection that exploits this property of speech [7], [8] where, we construct two dictionaries for sparse representation characterizing the subspace of the query and background speech independently. The sparse recovery reconstruction error is used as the score for query detection. To incorporate the sequential information, adjacent frames were concatenated to perform a frame-level detection. However, this approach lacks a proper framework to exploit the temporal information inherent to spoken queries. We observe that the two kinds of systems discussed above use complementary information present in speech to perform the same task. In order to take advantage of both systems, we propose a new DTW technique considering the subspace structure in speech. This method relies on the notion that a spoken query lies in a lowdimensional subspace which can be represented as a sparse linear combination of corresponding training data. The training examples of the query are used to construct a dictionary for sparse representation which models the query subspace. These dictionaries can be used to obtain a sparse representation of test audio frames which can be further utilized to calculate reconstruction error for each frame [9]. The error for a test frame can be considered as the distance between the query subspace and the corresponding frame. We propose to use the subspace based distance to regularize the distance matrix for DTW. Each column of the distance matrix corresponds to the frame-level distance between a test frame and all frames of the query. Whereas, we have only one number representing the distance from a test frame to the query subspace as a whole. Thus, to regularize the distance matrix, we consider a column of it corresponding to a test frame and take a weighted average of each element in this column with the subspace based distance obtained using the same test frame. Now, we perform dynamic programming on this regularized distance matrix to obtain the region of occurrence of the query and calculate the likelihood of its occurrence. A comprehensive block diagram for the proposed system is presented in Fig. 2. The key idea behind the proposed method is, the frame-level distance provides local similarity and helps to capture the temporal information inherent to speech whereas, subspace based distance captures the similarity on subspace-level which considers all the frames present in the query for each test frame. A combination of these two distances provide better likelihoods for making a decision as can be seen through performance improvement. In principle, our approach can work with any variant of DTW by regularizing the corresponding distance matrix. However, in this work, we implement the system presented in [10] and perform the proposed regularization over the distance matrix followed by dynamic programming to obtain the region of occurrence along with likelihood score. The system in [10] is based on segmental DTW and is one of the best systems available for this task. Thus, we use this system as baseline for comparison purposes. We have performed spoken query detection experiment on AMI meeting corpus [11] to show the potential of our approach. There are approximately 12k words in the training, out of which 200 frequent words are used as queries for our detection experiments. Then, these queries are divided into 2 sets of 100 queries each, to have development and evaluation queries. The feature vectors of each query serve as the dictionary for sparse coding as well as the reference template for DTW. Different parameters of the system are optimized using development queries. The results on evaluation queries are shown using detection error trade-off (DET) curve in Fig. 3. It is clear from the plots presented in Fig. 3 that, our proposed system significantly outperforms the baseline system. These results further show that, the low-dimensional subspace structure of speech can be very useful for spoken query detection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Subspace Detection of DNN Posterior Probabilities via Sparse Representation for Query by Example Spoken Term Detection

We cast the query by example spoken term detection (QbESTD) problem as subspace detection where query and background subspaces are modeled as union of low-dimensional subspaces. The speech exemplars used for subspace modeling are class-conditional posterior probabilities estimated using deep neural network (DNN). The query and background training exemplars are exploited to model the underlying ...

متن کامل

Unsupervised Hidden Markov Modeling of Spoken Queries for Spoken Term Detection without Speech Recognition

We propose an unsupervised technique to model the spoken query using hidden Markov model (HMM) for spoken term detection without speech recognition. By unsupervised segmentation, clustering and training, a set of HMMs, referred to as acoustic segment HMMs (ASHMMs), is generated from the spoken archive to model the signal variations and frame trajectories. An unsupervised technique is also desig...

متن کامل

DTW-Distance-Ordered Spoken Term Detection and STD-based Spoken Content Retrieval: Experiments at NTCIR-10 SpokenDoc-2

In this paper, we report our experiments at NTCIR-10 SpokenDoc-2 task. We participated both the STD and SCR subtasks of SpokenDoc. For STD subtask, we applied novel indexing method, called metric subspace indexing, previously proposed by us. One of the distinctive advantages of the method was that it could output the detection results in increasing order of distance without using any predefined...

متن کامل

Query-by-Example Spoken Term Detection

This paper aims at a search in a large speech database with zero or low-resource languages by spoken term example in a spoken utterance. The data can not be recognized by Automatic Speech Recognition system due to a lack of resources. A modern method for searching patterns in speech called Query-by-Example is investigated. This technique exploits a well-known dynamic programming approach named ...

متن کامل

IIIT-H SWS 2013: Gaussian Posteriorgrams of Bottle-Neck Features for Query-by-Example Spoken Term Detection

This paper describes the experiments conducted for spoken web search (SWS) at MediaEval 2013 evaluations. A conventional approach is to train a multi-layer perceptron using high resource languages and then use it in the low resource scenario. However, phone posteriorgrams have been found to under-perform when the language they were trained on differs from the target language. In this paper, we ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017